012_subtransactions.pl vs clang -fsanitize=undefined

  • Jump to comment-1
    pryzby@telsasoft.com2022-07-24T20:59:37+00:00
    I found that -fsanitize causes the test to fail, going back to REL_10_STABLE, for any clang in: 1:11.1.0-6 1:12.0.1-19ubuntu3 1:13.0.1-2ubuntu2 1:14.0.0-1ubuntu1 | time ./configure --enable-cassert --enable-debug --enable-tap-tests --with-CC=clang-13 CFLAGS='-fsanitize=undefined' | time { make -j4 clean; make -j4; } >/dev/null | time PROVE_TESTS=t/012_subtransactions.pl make -C ./src/test/recovery check | | t/012_subtransactions.pl .. 2/? | # Failed test 'Visible' | # at t/012_subtransactions.pl line 111. | # got: '-1' | # expected: '8128' | ... | # Looks like you failed 6 tests of 12. I haven't found any combination of options which cause it to fail differently, so I'm not even sure if the problem is in postgres, the test case, clang or libubsan. Note that optimization seems to avoid the problem, which is why "kestrel" shows no issue. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-07-23%2022%3A17%3A48 -- Justin
    • Jump to comment-1
      thomas.munro@gmail.com2022-07-24T21:19:37+00:00
      On Mon, Jul 25, 2022 at 8:59 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > I found that -fsanitize causes the test to fail, going back to REL_10_STABLE, > for any clang in: > > 1:11.1.0-6 > 1:12.0.1-19ubuntu3 > 1:13.0.1-2ubuntu2 > 1:14.0.0-1ubuntu1 > > | time ./configure --enable-cassert --enable-debug --enable-tap-tests --with-CC=clang-13 CFLAGS='-fsanitize=undefined' > | time { make -j4 clean; make -j4; } >/dev/null > | time PROVE_TESTS=t/012_subtransactions.pl make -C ./src/test/recovery check > | > | t/012_subtransactions.pl .. 2/? > | # Failed test 'Visible' > | # at t/012_subtransactions.pl line 111. > | # got: '-1' > | # expected: '8128' > | ... > | # Looks like you failed 6 tests of 12. > > I haven't found any combination of options which cause it to fail differently, > so I'm not even sure if the problem is in postgres, the test case, clang or > libubsan. Note that optimization seems to avoid the problem, which is why > "kestrel" shows no issue. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-07-23%2022%3A17%3A48 Yeah I've seen this too... it'd be good to figure out how to fix it: https://www.postgresql.org/message-id/CA%2BhUKGLDA-GuQKRvDF3abHadDrrYZ33N9e4DEOGwKH3JqdYSCQ%40mail.gmail.com
      • Jump to comment-1
        tgl@sss.pgh.pa.us2022-07-24T21:50:59+00:00
        Thomas Munro <thomas.munro@gmail.com> writes: > On Mon, Jul 25, 2022 at 8:59 AM Justin Pryzby <pryzby@telsasoft.com> wrote: >> I found that -fsanitize causes the test to fail, going back to REL_10_STABLE, >> for any clang in: > Yeah I've seen this too... it'd be good to figure out how to fix it: > https://www.postgresql.org/message-id/CA%2BhUKGLDA-GuQKRvDF3abHadDrrYZ33N9e4DEOGwKH3JqdYSCQ%40mail.gmail.com Yeah, reproduces here too with RHEL8's clang 13.0.1. I also see that the failures are due to "stack depth exceeded" errors from that recursive hs_subxids() function. As best I can tell, the stack depth failure is entirely honest: (gdb) p stack_base_ptr $1 = 0x7ffd92032100 "\360[\\\006" (gdb) p $sp $2 = (void *) 0x7ffd91e305a0 (gdb) p 0x7ffd92032100 - 0x7ffd91e305a0 $3 = 2104160 I can get at most 82 recursion levels without failure. With my normal build it can get to 678 levels before dying. I think what's happening is just that this build configuration eats stack extravagantly. Maybe it keeps all locals on the stack instead of in registers? I'm too lazy to check out the assembly code. I thought for a moment of rewriting hs_subxids() to iterate instead of recurse, but it doesn't look like that's tremendously feasible in plpgsql --- the only way to make nested subtransactions is to recurse. It could probably be done from the client by issuing a series of SAVEPOINT commands, but not nearly as elegantly. regards, tom lane
        • Jump to comment-1
          tgl@sss.pgh.pa.us2022-07-24T22:18:03+00:00
          I wrote: > I think what's happening is just that this build configuration > eats stack extravagantly. That's definitely it, but I don't entirely see why. Here are a couple of major offenders though: (gdb) x/8i ExecInterpExpr 0x11a5530 <ExecInterpExpr>: push %rbp 0x11a5531 <ExecInterpExpr+1>: mov %rsp,%rbp 0x11a5534 <ExecInterpExpr+4>: sub $0x2f40,%rsp 0x11a553b <ExecInterpExpr+11>: mov %rdi,-0x10(%rbp) 0x11a553f <ExecInterpExpr+15>: mov %rsi,-0x18(%rbp) 0x11a5543 <ExecInterpExpr+19>: mov %rdx,-0x20(%rbp) 0x11a5547 <ExecInterpExpr+23>: jmpq 0x11a554c <ExecInterpExpr+28> 0x11a554c <ExecInterpExpr+28>: cmpq $0x0,-0x10(%rbp) (gdb) p 0x2f40 $51 = 12096 (gdb) x/8i ExecInitExprRec 0x11672e0 <ExecInitExprRec>: push %rbp 0x11672e1 <ExecInitExprRec+1>: mov %rsp,%rbp 0x11672e4 <ExecInitExprRec+4>: sub $0x3c80,%rsp 0x11672eb <ExecInitExprRec+11>: mov %rdi,-0x8(%rbp) 0x11672ef <ExecInitExprRec+15>: mov %rsi,-0x10(%rbp) 0x11672f3 <ExecInitExprRec+19>: mov %rdx,-0x18(%rbp) 0x11672f7 <ExecInitExprRec+23>: mov %rcx,-0x20(%rbp) 0x11672fb <ExecInitExprRec+27>: lea -0x60(%rbp),%rdi (gdb) p 0x3c80 $52 = 15488 It looks like this build eats about 24K of stack per plpgsql recursion level, of which ExecInterpExpr alone accounts for half. Why is that? It has no large local variables, mostly just ints and pointers. There are a lot of them, but even if you presume that each one gets its own dedicated bit of the stack frame, it's hard to arrive at 12K. I'd almost call this a compiler bug. regards, tom lane
          • Jump to comment-1
            thomas.munro@gmail.com2022-07-24T22:39:25+00:00
            On Mon, Jul 25, 2022 at 10:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > I think what's happening is just that this build configuration > > eats stack extravagantly. > > That's definitely it, but I don't entirely see why. Here are a > couple of major offenders though: Interesting. I wonder where we can read about what stuff clang puts on the stack to implement the undefined behaviour checker (and what GCC does differently here), but today I will resist the urge to go looking. As for workarounds (and as a note for my future self next time I'm testing with UBSan), this is enough for the test to pass on my dev box (4MB is not enough): --- a/src/test/recovery/t/012_subtransactions.pl +++ b/src/test/recovery/t/012_subtransactions.pl @@ -16,6 +16,7 @@ $node_primary->append_conf( 'postgresql.conf', qq( max_prepared_transactions = 10 log_checkpoints = true + max_stack_depth = 5MB )); It's also possible to tell it to keep out of certain functions: https://github.com/llvm/llvm-project/blob/main/clang/docs/UndefinedBehaviorSanitizer.rst#disabling-instrumentation-with-attribute-no-sanitize-undefined